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Let {Y, Xi, . . . , X,n) be a random vector. It is desired to predict Y 
based on {Xi, . . . ,Xm)- Examples of prediction methods are regres- 
sion, classification using logistic regression or separating hyperplanes, 
and so on. 

We consider the problem of best subset selection, and study it in 
the context m = n", q > I, where n is the number of observations. We 
investigate procedures that are based on empirical risk minimization. 
It is shown, that in common cases, we should aim to find the best 
subset among those of size which is of order o(n/ log (n)). It is also 
shown, that in some "asymptotic sense," when assuming a certain 
sparsity condition, there is no loss in letting m be much larger than 
n, for example, m = n", a > 1. This is in comparison to starting with 
the "best" subset of size smaller than n and regardless of the value 
of a. 

We then study conditions under which empirical risk minimization 
subject to li constraint yields nearly the best subset. These results 
extend some recent results obtained by Greenshtein and Ritov. 

Finally we present a high-dimensional simulation study of a "boost- 
ing type" classification procedure. 

1. Introduction and preliminaries. Let = (y* , X| , . . . , X^) , i = 1, . . . ,n, 
be i.i.d. vectors, Z"^ ^ F where F is unknown. It is desired to find a good 
predictor for Y given . . . , Xm, based on the observations Z*, i = 1, . . . , n. 
In this paper we consider high-dimensional learning problems, where the ob- 
jective is to select a good predictor from a large class, based on minimizing 
an empirical risk. We concentrate on the case where the dimension is much 
larger than the number of observations, that is, m ^ n. 

There are three main goals of this paper. One is to advocate the practice of 
turning to high dimensions of explanatory variables for the purpose of finding 
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good predictors. Another is to give a perspective to the phenomenon of "not 
getting overfit," when applying high-dimensional procedures, as discussed 
in [2]. We will suggest that often such procedures may be viewed as (sub- 
optimal) optimization methods for finding the empirically best subset of 
explanatory variables. A final goal is to show that often optimization under 
h constraint (as in "Lasso" ) could be a helpful and computationally feasible 
method for finding good predictors in high dimensions. 

We describe now a few examples where an analysis with m ^ n is con- 
ducted. In microarray experiments the explanatory variables are measure- 
ments describing activity of certain m genes in n subjects, while the response 
could be survival time or an indicator of the event that the subject has a 
certain disease, and so on; see [21]. Under the current technology, a typical 
microarray experiment involves thousands of genes, that is, the dimension 
m is of the order of thousands, while n is of the order of hundreds or less. 

In [25], page 496, the following pattern recognition example is described. 
It is desired to train a machine to identify handwritten digits for the purpose 
of recognizing handwritten zip codes. The raw data given to the machine 
comes from 256 pixels, that is, the raw data is made up of 256 variables. 
Yet, for their classification method, they considered all interactions up to 
order 7. This creates m ~ 10^^ explanatory variables constructed from the 
initial set of 256. The amount of data (or training set) they were using was 
n = 7291. 

Finally, consider the following example as a plausible data mining ap- 
plication of analysis with ?ti S> n. An insurance company is interested in 
estimating the probability of a claim, due to a car accident, by various cus- 
tomers. We may define for each customer quite a few categorical variables 
based on age, sex, car make, car model, marital status, address, and so on. 
Considering also third- or fourth-order interactions of these categorical vari- 
ables, one does not need a lot of imagination to come up with tens and 
hundreds of millions of categorical explanatory variables. Of course, the in- 
surance company might have access to a big historical database, so n may 
also be very large. 

Although our motivation is to understand the problem where m ^ n, 
there are also implications to the following more classical problem when 
m < n. Informally the problem may be stated as follows: how many obser- 
vations, n, do we need, in order to accurately estimate m parameters? Our 
asymptotic approach suggests that in many cases the condition mlog(m) = 
o(n) suffices. See further discussion at the end of this section. 

We will consider and formulate our problem in various degrees of gener- 
ality. The ideas are easier to introduce and motivate through the problem of 
best subset selection in regression, but will be carried out in a more general 
context. 
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Let Z = (Y, Xi, . . . , Xm) be a random vector Z ^ F unknown. Consider 
first the problem of selecting a linear predictor for Y based on Xi , . . . , Xm , 
that is, a function of the form J2jPj^j- We identify a predictor with the 
vector P = (/?!, . . . ,f5rn)- Its performance is evaluated based on 

(1) LF{P)=Ep(Y-Y,(3jX,f. 

The selection of a predictor is based on a sample of i.i.d. observations Z^, 
i = 1, . . . ,n. In practice, as the sample size, n, increases, we might want to 
consider more complicated models or linear predictors, that is, increase the 
number m of explanatory variables. Thus, a worthwhile asymptotic study is 
of a triangular array form, where we are given n i.i.d. observations Z^, . . . , Z^ 
at stage n, ~ Fn, Fn is unknown, Fn G J^n- In order to simplify notation, 
we will drop the index n of the triangular array and write Z*; thus, at 
stage n, = . . . ,Xly^). Here m = m{n) is the number of explana- 

tory variables, which depends on n and typically grows with n. We will 
study asymptotics where m = n°', a > 1. See further discussion on the tri- 
angular array setup in [13]. Further papers investigating a similar regression 
triangular array structure are [16, 17, 19]. These papers also study the Lasso 
and regularization via li constraints, as we do in this paper. A recent paper 
that studies the virtue of letting m be much larger than n in classification 
problems is [1]. 

The above regression setup motivates us to generalize as follows. Consider 
a triangular array as before, equipped with an abstract triangular structure 
of parametrized predictors, that is, at stage n, a collection of functions 

where gp = ^/^(Xi, . . . , and the parametrization is Euclidean. 

Consider a general nonnegative prediction loss /, incurred for predicting 
gp{Xi, . . . , when the outcome is 1", 

l = l{Y,gp{Xi,...,X^i^n)))- 
To simplify notation, we will abuse and write 

/(/3,z)EE/(y,g^(Xi,...,x„(„))). 

As in equation (1), we define 

(2) Lf{P) = EfI{P,Z). 

Note in equation (1) we used a squared loss I. As an additional example, 
consider classification where Y may be either -|-1 or —1, the predictors are 
of the type g/s^Xi, . . . , Xm) = sign(X] Pj^j), and the prediction loss is 0-1. 

In the current more abstract formulation, we will consider entry j of the 
parameter /? ^^active" if f3j ^ 0. Note, in order to relate to regression and 
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other important examples, we denote both the dimension of the explanatory 
variables and of the parameter space by m. However, in the abstract formu- 
lation the dimension of the explanatory variables is actually not relevant. 
In the sequel, assumptions made about m = m[n) may in fact be assumed 
only on the dimension of the parameter space. 
Let 

=argminLi;'„(/3). 

From now, when we say triangular array, we mean a sequence of collec- 
tions of distributions J^n, a sequence of collections of predictors which 
are available at stage n, n = 1,2, . . . , and a prediction loss function /. 

We will study sequences of procedures P = (3{Z^ , . . . , Z^) that select a 
predictor (3 G B^, based on the observations Z^,...,Z^. Here Z'^ are i.i.d. 
distributed Fn, Fn G J-'n- The dependence of /3 = /3„ on n is often suppressed, 
and we will loosely say the procedure /?. 

Definition 1. Given a triangular array, the sequence of procedures Pn 
is persistent with respect to if, for every e > 0, 

(3) sup PF„{Lp„0n) - LfMfJ > e) ^ 0. 

It is not difficult to see that the above is equivalent to the following: for 
any sequence F„ G 

Lp„(/3„)-Lp„(/3>J^O. 
Here the distribution of is determined by F„. 

Remark 1. (a) The concept of persistence is close to that of consistency. 
Yet, in consistency there is a certain, usually "true," fixed parameter to 
which a consistent estimator converges. In our setup the analog of the true 
parameter is f3p^, which changes with n. Also, in consistency convergence is 
usually in terms of the Euclidean distance between the true parameter and 
its estimator, while in persistence the distance is tied to the loss. 

(b) Consider the triangular array structure that motivates us, where as 
n grows we consider larger nested collections of predictors B^. In such a 
nested structure we may consider the joint distribution F^ of all variables, 
that is, the joint distribution of {Y,Xi, . . . Let F^ be the marginal 

of F^ on a{Y,Xi, . . . ,X^i^\). Obviously Lpo{(3po) is monotone decreasing 
since B^ C B^^^. Thus, there is a limit 

limLpo(/?>o)=r(F^). 

When r{F^) > 0, the persistence criterion should have appeal. In situations 
where r{F^) = 0, other criteria should be studied and rates of convergence 
become relevant, rather than only persistence. 
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Under mild conditions, existence of a persistent procedure will follow if 
Pp , the best predictor in S", has /c„ = o(n/ log(n)) nonzero entries, also 
termed o(n/ log(re)) sparsity rate. This may be shown by a simple entropy 
derivation; see Theorem 1 of the next section. It will also be demonstrated in 
Section 2 that if, for the relevant sequence i?", the corresponding sequence 
Pp has o(n/ log(n)) sparsity rate, then there is only a mild effect on the 
ability to find a predictor which is nearly as good as Pp^, when increasing 
the dimension m dramatically. 

Discussion of the asyniptotics and the sets B"^ . We further discuss now 
our notion of persistence with respect to sets B"^. The discussion is in light 
of the regression setup with m ^ n. Usually in asymptotics we evaluate 
procedures comparing their estimates (or selected predictors) to the "true" 
parameter or the absolutely best predictor. By absolutely best, we mean the 
best predictor among those that are linear in Xi, . . . ,Xm, rather than the 
best within a confined subset The goal is to do nearly as well as the 
absolutely best predictor. In regression when n there is no hope, in gen- 
eral, to do as well as the absolutely best linear predictor. A natural approach 
is to confine ourselves to various subsets i?" of the set of all predictors linear 
in Xi, . . . ,Xm., for example, the sets B^ = A{k), where A{k) denotes the set 
of all the linear predictors which are functions only of /c = k{n), k <m, ex- 
planatory variables. Then we should try to find a predictor which is nearly 
as good as the corresponding . Of course, the larger i?", the more chal- 
lenging is this task. Yet, for too large sets i?", that task is impossible due 
to reasons explained later using entropy. 

It turns out that a sufficient condition for the existence of a persistent 
sequence of predictors with respect to B^ is that the corresponding sequence 
f5*p has a sparsity rate k{n) = o(n/log(n)). Note, the last condition on the 
sparsity rate is trivially satisfied for the sets B"^ = A{k), where k = k{n) = 
o(n/log(n)); hence, our further development is always meaningful for such 
sets 5". 

Our phrasing is slightly different than that of Friedman et al. [11], who 
write "Use a procedure that does well in sparse problems, since no procedure 
does well in dense problems." The slight difference in our point of view is 
that we consider a procedure as doing well, when it does well relative to 
collection B^ of predictors from which it is feasible to discover nearly the 
best predictor, with the given sample size. We do not care (since we cannot 
do much about it) if the absolutely best predictor is indeed in B^ or not. 
We certainly do not assume that the problem is sparse, that is, that the 
absolutely best predictor is sparse. 

To summarize, we set reasonably high, yet realistic, standards for our pro- 
cedures, rather than the highest but often impossible to achieve standards. 
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In Section 2 the procedures achieving persistence wiU be of the type of best 
subset selection. More precisely, these procedures search for the empirically 
best predictor among those in the set -B" = A{k), k = k{n) = o(n/ log(n)). 
Their algorithmic complexity makes such procedures impractical. In Sec- 
tion 3 persistent procedures with lower algorithmic complexity will be intro- 
duced for problems with the following intermediate level of generalization. 
We will consider cases where the function gp{Xi, . . . , Xm) may be presented 
as pi^l3jXj). We will then show that, for those intermediate level of gener- 
alization setups, often Lasso-type procedures are useful. By Lasso-type pro- 
cedures, we mean minimization of Lp{(3) subject to a constraint on the li 
norm of f3. Here F is the empirical distribution based on the data Z^, . . . , Z** 
and 

^ n ^ 

Finally, in Section 4 a simulation study, in high dimensions, is presented 
for a classification method tied to boosting. The simulated classification 
method involves optimization under l\ constraint. 

The case where m < n. Our formulation and problems are meaningful 
also in the case m <n. Consider regression again. Let i?" be, as is customary, 
the set of all linear functions of Xi , . . . , X^ . Then Pp^ is the absolutely 
best predictor. A related problem in a triangular array formulation was 
studied by Huber [15], Yohai and Marona [26] and Portnoy [23]; see further 
references there. In their setup it is desired to estimate the coefficients in a 
regression problem, where the number of explanatory variables is increased 
with the number of observations. Under their model, where it is assumed 
that Y = J2 Pj^j ~^ ^1 = 0, the error is not (necessarily) normal and 
may have heavy tails; also, the explanatory variables are nonrandom. They 
study consistency in terms of I2 distance between the estimate and the true 
parameter. The results by Huber and by Yohai and Marona suggest that 
a sufficient condition for consistency is that the rate that m increases with 
n is m = o{\/n). Note that when assuming finite variance for e, and that 
the minimal eigenvalue of the design matrix is of order 0(n) (as in the 
case where the columns are orthogonal and the entries are of order 1), a 
rate m = o{n) is possible using the least squares estimator. However, their 
interest was mainly in situations involving heavy tails where the variance is 
not finite. 

Portnoy [23] showed that, under natural assumptions, we may let m grow 
much faster and allow a rate of m = o(n/log(n)). Notice the huge gap com- 
pared to the former mentioned rate of o{y/n). We will also show that the 
rate suggested by Portnoy should imply persistence in many cases. Yet, we 
are also left with a similar huge gap; see Remark 4 in Section 2. 
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2. Sparsity and persistence. In this section we will give conditions on 
triangular arrays under which there exists a procedure satisfying (3). 

The following condition will be assumed on the prediction loss I given a 
triangular array. 

Condition 1. For every e, there exists M(e), such that for large enough 
n, if Lp^{f3) > Lf^{[3p ) + 2e, then the truncated random variable Tg = 
min(/(/?,Z),M(e)) satisfies 

Note that Condition 1 is obviously satisfied for a bounded prediction loss 
I. In fact, under Condition 1 we may later assume w.l.o.g. that 1{(5,Z) is 
bounded uniformly under all the distributions in Tn- This will enable us to 
apply large deviations principles on the fluctuations of Lp{(3) from its mean 

The following easily proved theorem, Theorem 1, is stated for a general 
triangular array setup. It is a key theorem to understand why, for very 
general triangular array setups, a predictor should be searched among the 
set A{kn) of predictors with corresponding parameters having at most /c„ = 
o(n/log(n)) active entries. In Theorem 6 of [13], it is shown in a regression 
setup that this rate cannot be improved, that is, an example is given where 
the sparsity rate is kn = 0(n/log(n)), in which there exists no persistent 
procedure (of any kind!). 

The idea of that proof applies for more general situations, as treated in 
the current paper. Thus, it seems that, for quite general triangular arrays, 
when m = n°, a > 1, the rate kn = o(n/log(n)) is also an upper bound for 
achieving persistence. 

£- entropy. We will use the concept of e- entropy of a set of predictors 
indexed by /3, /3 G -B, given a collection of distributions J- and a prediction 
loss I. The definition for it is e-entropy= log(A^), where is the minimal 
number of points, denoted j3^, . . . ,13^, satisfying that for each /3 £ B there 
exists a point f3^ such that, for every F G J^, \Lf{(3^) — Lf{(3)\ < e. A set of 
such N points will be called an e-grid. 

Note, given any F, F £ J^, in order to select a predictor whose performance 
is within £ of the optimal predictor /3p, it is enough to select the best among 
an e-grid of points. 

Remark 2. In order to prove the existence of a persistent procedure 
with respect to a sequence -B", it is enough to show the existence of a se- 
quence of procedures satisfying (3) for every fixed e. Then, a diagonalization 
argument implies the existence of a persistent procedure. Hence, in the fol- 
lowing and throughout we will concentrate on showing, for any e > 0, the 
existence of a procedure (depending on e) satisfying (3). 
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Theorem 1. Given a triangular array satisfying Condition 1, assume 
the following: 

(i) For every sequence Fn, n= 1,2,..., the parameter (3p belongs to a 
kn = o{n/ \og{n))- dimensional cube centered at the origin, with Euclidean 
volume Rn, where log(i2„) = o{n). [Note, in particular, the implied sparsity 
rate is o(n/ log (n)).] 

(ii) The functions Lf^{/3) satisfy the following Lipschitz condition: for 
any e>0, there exist S>0 and 7 > 0, such that if — (3'\\2 < Sn~'^ , f3,P' G 
A{kn), then \Lf„{P) — L_p„(/3')l < uniformly in Fn G T"" . 

Then, for every e > there exists a sequence of procedures satisfying (3), 
and whence, there exists a persistent procedure. 

Convention. Throughout, we require conditions to hold at Pp^ or at 
P = argmin^g^n Lp{(3). When these points are not unique, such a condition 
should be understood as being satisfied if it holds for one of the relevant 
points. 

Proof of Theorem 1. The proof is based on a simple entropy cal- 
culation. There are less than m^" subsets of coordinates of size kn- For 
each such subset, consider all the predictors determined by active param- 
eters in this subset. For any F„ G , the function Lf^{P), confined to 
this subset, is viewed as a fc^-dimensional function. Divide the correspond- 
ing /c„-dimensional cube into disjoint small cubes with vertices of length 
S/y/k^n'^. Thus, each point in the cube is within Euclidean distance 5n~'^ 
from the center of one of the small cubes, in particular, its true for the 
point Pp^. These centers determine an e-grid with respect to the confined 
versions of Lp^{P),Fn £ Tn, given a specific subset and a corresponding 
/cn-dimensional cube; this follows from the Lipschitz condition (ii). The car- 
dinality of the defined e-grid is Rn / [6 / ^/k^n"']^^^ = exp(log(i2„) -|- [log(|) -|- 
log(-v/A^) -|-7log(n)]A:„) = An- There are less than Bn = m^'^ = exp(alog(n)A;„) 
such subsets, so altogether, the number of points needed to construct an e- 
grid, with respect to the set of all predictors containing only points /3 with 
at most kn nonzero coordinates and belonging to a cube as in (i), is less 
than N = An x Bn- Now, log(^„ x Bn) is of order o{n) if kn = o(n/ log(n)). 

It is now standard to show that selecting argminL^(/3), where the mini- 
mization is over an |-grid, will yield a procedure that satisfies (3). The reason 
is as follows: by Condition 1, we may, w.l.o.g., assume that I is bounded and 
thus, we may conclude exponential rates of convergence to zero of probabil- 
ities of large deviations (see, e.g., Hoeffding's inequality [25], page 185). Let 
Cn be the |-grid of points. Since log(A^) is of order o{n), where is the 
cardinality of C„, we obtain, by applying large deviation exponential rates 
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coupled with Bonferroni, Pp^(sup^gc'^ Lf^{(3) — Lp {(3) > e) — > 0. The result 
now follows. □ 

The conditions of Theorem 1 imply persistence of procedures that are 
confined to search for the empirically best predictor among those in A{kn). 
The search is also restricted to points which are located in a predetermined 
cube centered at the origin, where the log of its volume is of order o(n). 
The best point is "known" to be in such a cube. The last restriction is very 
weak since the volume of the cube may grow fast. Still, the last restriction 
could be "mathematically annoying." Under condition (a) of the following 
Corollary 1, this restriction may be avoided. 

Another issue is that the procedure achieving persistence in the proof of 
Theorem 1 searches in a predetermined grid of points. This is again an arti- 
ficial restriction. Condition (b) in Corollary 1 requires an analog of the Lip- 
schitz condition in Theorem 1 to hold under the empirical function, Lp{j3). 
Then, it may be concluded that the empirical risk minimization procedure, 
minimizing over the entire set A{kn), is persistent, that is, there is no need 
to minimize in a predetermined set of grid points. 

Corollary 1. Consider a triangular array satisfying Condition 1. As- 
sume condition (ii) of Theorem 1. Assume further a sparsity rate /c„ = 
o(n/log(n)). Finally, assume the following: 

(a) With probability approaching 1 uniformly for sequences F^, 
/3 = argmin^g^^;.^-) belongs to a kn = o{n/\og{n))- dimensional cube 
centered at (3*p^, with Euclidean volume Rn, where log(i?n) = o{n). 

(b) With probability approaching 1 uniformly in sequences Fn, the random 
function Lp {(3) satisfies the following Lipschitz condition: for any e > 0, 
there exist 5 > and 7 > 0, such that if ||/3 — (3'\\2 < 6n~'^ , then \Lp (/3) — 
Lp^{p')\<e, eA{kn). 

Then the procedure /? = argmin^g^^^^^ (/?) is persistent. 

Proof. Condition (b) implies that minimizing with respect to A{kn) 
is asymptotically equivalent to minimizing with respect to a predetermined 
(dense enough) grid contained in A{kn)- Similarly, condition (a) implies that 
minimizing with respect to A[kn) is asymptotically equivalent to minimizing 
with respect to its intersection with a predetermined cube centered at (3 p. 
The conclusion now follows by applying Theorem 1. □ 

Note, often condition (ii) of Theorem 1 follows from condition (b) of 
Corollary 1. 
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Under condition (b) of Corollary 1 and condition (ii) of Theorem 1, for a 
bounded /, the following may be proved similarly to the proof of Corollary 1. 
Denote by A{kn,Rn) the union of all /c^-dimensional cubes with volume TLfi 
each. Suppose log(i?„) = o(n) and kn = o(n/log(n)). Let e > 0, and Fn S 
be a sequence of distributions. Then 

(4) PfJ sup \Lp^{(3)-Lp^iP)\>e)^0. 

VC dimension. There is another approach to obtain the type of result 
in Corollary 1, that is, avoiding the annoying assumption that the optimal 
predictor is located in a huge cube, and avoiding artificial procedures that 
search for a predictor in a predetermined grid. It is related to the sophisti- 
cated and deep concept of VC dimension. 

A way of showing that selecting the predictor that empirically minimizes 
the risk is equivalent to a search on a grid of N points is through the concept 
of VC dimension of a class of functions. Using this concept, one may also 
bound N. These bounds depend only on properties of the class of functions 
l{f3,z), as functions of z and not of the collection of distributions that is 
involved. 

Consider the collection of functions l{P,z) = lf3{z), (3 S -B", as functions 
of z. Let us confine ourselves to subsets of functions lp{z) parametrized by 
/?, whose parameter (3 may have nonzero entries only for certain kn indices. 
Suppose the VC dimension of each such confined subset of functions is of 
order 0{kn)- Ideas as in Theorem 1 and Corollary 1 imply that the procedure 
(5 = argmin^g^^^^-) is persistent when kn = o(n/log(n)) and a kn 

sparsity rate is assumed. 

In the following Example 1, we rederive and generalize a result of Green- 
shtein and Ritov [13]. This is by a simple application of Theorem 1 and 
Corollary 1. Unlike here, Greenshtein and Ritov used properties of the min- 
imal eigenvalue of a Wishart matrix to establish their result. 

Example 1. Let = {Y\X\,. . • m = n°',a> 1, where are 

i.i.d. multivariate normal of dimension m(n) + 1 with bounded second mo- 
ments for Xj and under {J^n}- Consider a regression setup, that is, a 
squared prediction loss / and the set of linear predictors. Under these con- 
ditions, we will show that, for i?" = A{kn), where kn = o{n/log{n)), the 
procedure /3 = argmin^g^j-^^-j (/?) is persistent. 

Now, by appropriate reparametrization and invariance considerations, we 
may assume w.l.o.g., that Xj, j = 1, . . . ,m, are uncorrelated standard nor- 
mals; also, w.l.o.g., Y is uncorrelated with the explanatory variables, that 
is, = 0. Let var(y) = a^. Then Lf{P) = + a^, and hence 

(5) \Lf^{0) - Lp^m = ml - \\P'g\ < \\/3-P'\\l 
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and condition (ii) of Theorem 1 is satisfied. 

In the following we will check conditions (a) and (b) of Corollary 1, in 
order to finally apply that corollary. 

First, the Lipschitz condition, condition (b) of Corollary 1, is satisfied by 
Lp (P) with probability approaching 1. Observe that, for large enough 7', 

P(max(Xi, . . . , Xm) > rC' = M) approaches 0, this by combining Chebyshev 
and Bonferroni. Thus, with high probability, for S A{kn)-, \ Y^f3jX^j — 
J2 PjXjl <MY.\Pj-Pj\ < My/2K\\(3-(3'\\2. The last inequality is by Cauchy- 
Schwarz. Condition (b) of Corollary 1 follows, for a squared loss from the 
last inequality, when applied similarly to Z*, z = 1, . . . , n. 

Condition 1 follows from the multivariate normality. In fact, for the set of 
random variables l{f3,Z) with ||/3||2 < R, for some R< 00, we have uniform 
integrability and thus, w.l.o.g., the set consists of bounded random variables. 

We now turn to condition (a). We will show that, with probability ap- 
proaching 1, P = argmin^g^^;.^) Lp (/?) belongs to a ball with radius (say) 
2(7^, centered at f3p^ = 0. Let G be the union of all /i;„-dimensional balls of 
radius 2a^. Then by the above and by (4), given any eq > 0, 

(6) Pf„ ( sup|L^„(/3) - Lp {(3)\ > eq) ^ 0. 

V /3GG / 

Note, that since w.l.o.g. (3p^ = 0, we have (*) Lp^iP^J = Lf^{0) = cr^. For (3 
on the boundary of G, = 2(T^, hence, for such f3 we have (**) Lf^{(3) = 
2(T^ + 0"^. Condition (a) now follows by the convexity of Lp (/?), from (*), 
(**) and (6). 

Finally, applying Corollary 1, we obtain that the procedure /3 which selects 
the empirically best predictor from the set A{kn) is persistent. 

Remark 3. (i) In the last example we used only multivariate normality 
to conclude Condition 1. Hence, the result holds in much more general situ- 
ations. A proof along the lines of Example 1 is possible for other prediction 
losses, for example, Z) = \Y — J2Pj^j\- 

Remark 4. Consider a regression case, as in Example 1. Suppose we 
replace the multivariate normal assumption by the assumption that the en- 
tries of are bounded under the possible distributions in the triangular 
array. We cannot prove the o(n/log(n)) rate for kn, as in Example 1. The 
reason is that Condition 1 is not implied. Note, existence of M(e) for every 
fixed n is trivially implied by boundedness, but not existence of M{e) that 
holds uniformly for every n. In [13] a sparsity rate of k^ = o{\/n/ log(n) ) is 
shown to imply persistence, under an additional assumption, that the mini- 
mal eigenvalue of the covariance matrix of (^1, . . . ,^m) does not approach 
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0. Whether we may obtain persistence under higher rates, assuming only 
boundedness, is suggested there as a problem. We still do not know the 
answer to this problem. 

3. Optimization under Zi constraint and the Lasso. In the main result 
of this section, Theorem 2, we will show for special classes of parametrized 
predictors that we may achieve persistence and approximate the best subset 
of a certain size through optimization under li constraint. The special classes 
are of the form 

(7) gpiXi,...,X^) = p(j2(3jXj). 

As a further example, consider the class of predictors 

expE/3jXj] 



(8) g^iXi,...,Xm) 



l + expE/3jXj] 



The optimization under the constraint that the number of nonzero en- 
tries of P is kn has high complexity in general. It is desired to replace it by 
a constraint that determines a convex feasible set. When the target func- 
tion L p (/?) is also convex, then the problem has an algorithmically efficient 
solution; see [20]. 

An example where both the target function and the feasible set are convex 
is the Lasso procedure, that is, 

(9) min L^(/3) = min ^ E (^^ " E P^^) 

i 

subject to the constraint ||/3||i < 6 for a proper b. See [24]; also see basis 
pursuit in [5]. Recently Efron et al. [8] developed an efficient algorithm, 
called least angle regression, to solve the above optimization problem. We 
will elaborate on another example involving convex optimization in the next 
section. 

We study the replacement of the constraint on the number of nonzero 
entries of /3 by a convex constraint on its li norm. In recent papers by 
Donoho [6] and [7], a general setup is described, in which optimization under 
li constraint gives the actual optimal solution under the constraint on the 
number of nonzero entries. Our ultimate goal is not to find a predictor with a 
sparse representation; for us, searching for a sparse solution is only a means 
of regularization and of controlling the entropy. Thus, we need weaker results 
compared to those of Donoho; for our purpose, it is enough to show some 
kind of (weaker) equivalence between the solutions obtained under the two 
types of constraints. From the following Lemma 1, it follows that predictors 
with parameters that are obtained through optimization under a constraint 
on their li norm might (appear to) have more than /c„ "active entries," but 
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in fact it will be shown that, keeping the constraint in the right magnitude 
(depending on A;„), they are equivalent to predictors with parameters that 
have only fc„ active entries. From the last fact, our main theorem of this 
section. Theorem 2, will follow. It is a generalization of a result obtained by 
Greenshtein and Ritov [13] for regression. 

The following Lemma 1 is given without proof. It is a rephrasing of a 
result by Maurey (see [22]; a version of it may be found in [13], Lemma 
4, and in [16], Proposition 2.2). There, the analogous result is stated for a 
single distribution G, but the same proof works for a pair Gi and G2, as in 
what follows. 



Lemma 1. Let Gi and G2 be two distributions under which Xj, j = 
l,...,m, are bounded by M. Let (3 be an m- dimensional vector such that 
=h. Let 6 >Q. Then for every k > 0, there exists a corresponding vec- 
tor (5' , where = h, having at most k nonzero coefficients, such that 

Pg, [\£ PjXj - P'jXj \>S)< A'Ph'/S^K, z = 1, 2. 

We will confine ourselves to triangular arrays where, for each n, the pair 
consisting of prediction loss / and the collection of predictors {5/3} satisfies 
the following: 

Condition 2. For a fixed y, the function 

h(y,Y,P,Xj)^l{Y,gp{X^,...,X^)) 
is bounded and uniformly continuous in J2Pj-^j7 uniformly in y. 



The boundedness condition on / may be circumvented in various exam- 
ples. It may be weakened assuming a condition like Condition 1, or uniform 
integrability of l{f3,Z), /3 G B^. In Theorem 2 we will also require bounded- 
ness of Xj, this is in order to apply Lemma 1. If this assumption is avoided, 
the required sparsity rate in Theorem 2 would be o{n/ log{n)dn), where 
dn = supjp^ Ep^ [max(Xi, . . . , X^)]^. Again, the boundedness assumption on 
Xj may be avoided in special cases, like regression with multivariate normal 
Z^, as treated in Section 4 of [13]. We will leave the boundedness assumption 
for a clearer exposition. 

Our main theorem for this section is the following. 



Theorem 2. Consider a triangular array satisfying Condition 2 and 
having bounded Xj. Suppose the sparsity rate is kn = o{n / log{n)) . Suppose 
further that \\f3p^\\2 is bounded by R (w.l.o.g. R= 1) for every Fn, Fn £ J-n, 
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n = 1,2, . . . . Then the following procedure is persistent. Select the predictor 
P, where 

(10) /3 = argminL^ (/?), 

f3 



subject to the constraints ||/3||i < y^' 



Lemma 2. Assume Xj are bounded by M . Then Condition 2 implies 
both Lipschitz conditions, that is, condition (ii) in Theorem 1 and condition 
(b) in Corollary 1. 



(11) 



Proof. Observe that 

< MyjKW - (3*\\2 < Afn°-5||/3 - P* 



Here we have appUed Cauchy-Schwarz and the fact that kn < n. 

The proof fohows from the uniform continuity and boundedness of /. □ 

Proof of Theorem 2. Let (3 be the solution of (10) for a data set 
coming from F„. Then by Lemma 1, given ei > and 6i> 0, for any sequence 
Kn such that kn = o{Kn), there exists a parameter having at most k„ 
nonzero entries, such that both for Gi = Fn and for G2 = Fn we have 

(12) PG.{\T.f^'j^3 - E I ><5i) < £1 , i = 1, 2. 
Moreover, 

(13) ||/3'||i = ||^||i< V^. 

We choose a sequence k„ which is o(n/log(n)), so that (12) is satisfied. 
By Condition 2, (12) imphes both 

(14) |L^„(/3)-Lp^„(/3')|<e = e(ei,5i) 
and 

(15) \L^fp)-L^^{(3')\<e = e{ei,5i). 

Note that we may obtain (14) and (15) for e > arbitrarily small, by 
selecting large enough k„ = o(n/log(n)). By (13) and by construction, (3' 
belongs to a k„ = o(n/ log(n))-dimensional cube centered at (3p^, where the 
logarithm of the cube's volume is o(n). Also, by Lemma 2, both condition 
(b) of Corollary 1 and condition (ii) of Theorem 1 are satisfied. Hence, by 
(4) we obtain 

(16) PF^{\Lp^{p')-LFS(3')\>e)^Q. 
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Note, by assumption ||/3^^||2 < 1, whence, by Cauchy-Schwarz, H/Sl^^lli < 
\/Ti^. Thus, by the definition of /? we have 

(17) L^JP)<L^JP'^J. 
Finally, by the law of large numbers we have 

(18) PpAlLpAPk) - LpSf3k)\ > e) - 0. 

From (14), (15), (16), (17) and (18), we obtain persistence of (3, and the 
proof of the theorem follows. □ 

As remarked before, in practice the proper value for the constraint is 
unknown. One should try various values and test the resulting predictors on 
a test set. Our theory suggests that the resulting optimal li constraint will 
be of order ^n/log(n). 

Remark 5. From the proof of Theorem 2, we obtain, even when not 
assuming sparsity, an appealing feature of rules based on optimization under 
li constraint. The feature is self consistency of such procedures. The self 
consistency is in the following sense. 

Suppose P is obtained by (10) for /c„ = o(n/log(n)). Suppose Conditions 1 
and 2 hold. Then for every e > and every sequence Fn, 

PF„i\LpSP)-LF„0)\>e)^O. 

Corollary 2. From the above it follows that, under Condition 2, the 
procedure defined by (10) is persistent with respect to B" , the sequence of li 
balls with an li radius of order kn = o{^J^^^^) ■ 

(There is no need to assume sparsity.) 

Discussion. Regularization by general Iq constraints. The h constraint 
is motivated through a constraint on the number of nonzero parameters, 
which may also be represented as an Iq constraint. The advantage of the 
li constraint relative to other Iq constraints, q <1, is the convexity of the 
feasible set. Yet, from Theorem 2, we conclude that we will not gain much by 
optimizing via an Iq or lq,q < 1, constraint. This is since persistence under 
a o(n/log(n)) sparsity rate is already achieved using li constraint, while 
the proofs in this paper and the forementioned Theorem 6 of Greenshtein 
and Ritov [13] indicate that, in general, persistence cannot be achieved for 
higher rates. 



16 



E. GREENSHTEIN 



Lack of persistence of ridge regression. Regularization via Iq constraint 
with q> 1 will usually lead to nonpersistent procedures, which are also not 
self-consistent. Consider, for example, the case g = 2 in a regression context 
with a squared loss, called ridge regression. Suppose Z*, i = l,...,n, are 
multivariate normal, and suppose (3*p^ = 0, that is, are not correlated 
with the corresponding m explanatory variables. Assume also that Xj are 
uncorrelated standard normals. Denote cr^ = var(y). Minimizing the empir- 
ical risk subject to a constraint '^jS'j < 5"^ will yield (typically) a solution 

/3 which is on the boundary of the feasible set when 5^ < c^, that is, the 
estimate will have an I2 norm 5. This situation remains when m and n ap- 
proach infinity in a way that n, no matter how small is 6 > 0. Thus, 

LfAP) = S' + 't^ + while Lp„(/?>J = Lp^m = a\ 

When the regularization is via an li constraint, as suggested in this paper, 
again the minimizer of the empirical risk, denoted /3, will be on the boundary 
of the li ball, which is the feasible set. Yet now, when the li constraint is 
chosen properly to be o{yJn/ log(n) ), the h norm of that solution will be of 
order Op(l), hence, Lp^{[3) = + Op(l). This property of the l\ constraint 
is a consequence of our Theorem 2. 

Further discussion of the constraint regularization method and its com- 
parison with I2 regularization may be found in [11] and [4]. 

In general, regularization may be achieved by introducing penalty func- 
tions. For example, using Lagrange multipliers, one may see that the solution 
of the optimization problem, under Iq constraint, is the same as the solution 
of the related optimization problem when introducing the penalty function 
^Yl,\i^iV^ called Iq penalization, for an appropriate A. A study of regular- 
ization using general penalizations was conducted by Fan and Li [9] and 
by Fan and Peng [10]. In their setup analogous to our prediction loss /(•) 
is the log-likelihood, but the essence is the same (see some elaboration on 
it in [12]). They treat a general class of penalty functions, including the Iq 
penalties. In particular, for Iq penalization with q< \ and a proper choice 
of A, they show that a certain oracle optimality is achieved by penalized 
maximum likelihood procedures, while for g = 1, such optimality does not 
seem to be implied (the recommended penalty functions in those papers are 
not an Iq type, but a class of penalty functions called SCAD which pos- 
sesses further nice properties). In a sparse setup, an oracle optimality of 
procedures means the following. The rate of convergence to the estimated 
parameter is the same as the rate that may be achieved when knowing 
which are the zero entries of the parameter. These results are obtained also 
under a triangular array setup in [10], but when m{n) <^ n. In particular, 
for m = o{n°'), a = |, |, ^, under various assumptions and regularity condi- 
tions. These oracle optimality properties are much more delicate and strong 
than the persistence suggested by us. Such strong optimality criteria may 
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be achieved by procedures, due to the slow rate at which the dimension 
m = m{n) increases with n, in comparison to the rate in our setting. 

4. Numerical study. In this section we examine through simulation the 
following high-dimensional classification problem. Consider Z = {Y,Xi, . . . , 
Xm), where the value of Y is either —1 or +1. The prediction loss is 

(19) l{P,Z) = h{Y,Y,/3jXj) =exp(-y5^/J,X,). 

The convex loss (19) is used to motivate the boosting classification pro- 
cedure; see, for example, [14], page 305, or [3]. It may also be motivated as 
follows. Suppose we classify according to g^^i^Xi, . . . , Xm) = sign{J2 Pj^j)- 
Now the value of J2Pj^j is interpreted both through its sign and the mag- 
nitude of its absolute value. The sign determines the classification decision 
and the magnitude is interpreted as the "confidence in that decision." That 
is why wrong classifications with large magnitude are severely penalized and 
vice versa. 

Our optimization under li constraint is similar to the approach of Lugosi 
and Vayatis [18] . As observed by them, there could be many other interesting 
and natural convex prediction losses other than the above; for example, see 
their Example 3. Yet, (19) has attracted a lot of attention recently and we 
elaborate on it. 

In the following we present a simulation study where the dimension m is 
of the order of thousands, while the sample size n is of the order of hundreds. 

The simulation. We simulate n i.i.d. vectors. Each is M-dimensional and 
consists of M i.i.d. A^(0, 1), random variables. Denote the jth component of 
the ith vector by Xj. 

For each vector i, i = 1, . . . ,n, let be a A^(0,0.25) random number 
independent of Xj and define 

y^ = sign(^^l±l-±^ + H^^^ =sign{V' + W'), 

where is implicitly defined. Thus, the first 25 "explanatory variables" 
(out of the M available ones) are the relevant predictors for Y^, and the 
prediction should be through V^. 

Now we create, for each i, five additional random numbers (or simulated 
explanatory variables), denoted X\^^-^, . . . , X\.j_^_^, as follows: Xj = V^ + Uj, 
j = M + 1, . . . , M + 5; here C/j ~ iV(0, 9) are again independent of all the 
others and of each other. 

Notice we have m = M -\- 5 explanatory variables; only the first 25 are 
relevant for predicting Y^. Yet, if we may choose only a single explanatory 
variable to base our prediction on, we would rather choose X^, from the 
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Table 1 





n = 


500 and M = 


1000 fm = 


1005; 




V-training 


V-real 


Bl-l-norm 


B2-l-norm /3-1-norm 


A 


0.132 


2.365 


7.409 


0.444 


22.422 


0.01 


0.361 


0.850 


3.985 


0.291 


9.757 


0.03 


0.538 


0.810 


2.277 


0.270 


5.030 


0.05 


0.673 


0.817 


1.430 


0.240 


2.767 


0.07 


0.742 


0.850 


0.825 


0.277 


1.540 


0.09 


0.815 


0.860 


0.499 


0.246 


0.887 


0.11 


0.859 


0.880 


0.243 


0.242 


0.523 


0.13 


0.877 


0.895 


0.142 


0.229 


0.379 


0.15 


0.887 


0.902 


0.084 


0.224 


0.311 


0.17 



group of the last five; obviously if we may choose as many as 25 or more, we 
would choose the first 25. 

Our indirect method of searching for the best subset is through optimiza- 
tion under Zi constraint. Practically, the right constraint may be determined 
by cross-validation or a test set. In our simulation study, the performance 
of a predictor, obtained through such optimization under li constraint, was 
tested on an independent sample of size 1000. In Tables 1-3 the average pre- 
diction loss on the "data set" / "training set" is denoted F-training, while 
the average on the additional independent sample of size 1000 is denoted 
F-real. 

Our optimization is conducted using "Lagrange multipliers," that is, in- 
stead of optimization under li constraint, we optimize, for appropriate A > 0, 

Lp{0) + \Y,\f3,\. 

We try various values of A that correspond to various constraints on the li 
norm of (3. The optimization is through steepest descent, where special care 
is taken when computing the "partial derivative" of A^] l/^jl) foi' coordinates 
j where for the current iteration (3j = 0. 

In Tables 1-3 we summarize simulation results for various m and n. Only 
for the case n = 500, M = 1000 is a detailed table given, with the perfor- 
mance under various constraints. For the other cases, n = 100, M = 1000 
and n = 500, M = 5000, only the performance under the optimal constraint 



Table 2 

n = 100 and M = 1000 (m = 1005 ) 



V-training 


V-real 


Bl-l-norm 


B2-l-norm 


/3-1-norm 


A 


0.861 


0.926 


0.010 


0.207 


0.264 


0.30 
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Table 3 

n = 500 and M = 5000 (m = 5005 ) 



V-training 


V-real 


Bl-l-norm 


B2-l-norm 


/3-1-norm 


A 


0.690 


0.862 


0.680 


0.271 


2.181 


0.09 



is given. Each row is based on averages of 20 repetitions for a fixed A. In 
the same table, different rows correspond to different A, and the bigger A 
is, the more severe is the constraint. Indeed, one may see in Table 1 that as 
A decreases the difference between ^-training and F-real increases, that is, 
the generahzation power (or self consistency property) is reduced. We record 
the constraint also in terms of the li norm of P in the column /3-1-norm. 
The columns Bl-l-norm and B2-l-norm record the li norm of the first 25 
and of the last five coordinates, respectively. 

In practice, the column V-vesd will be replaced by evaluation of the per- 
formance of the suggested predictor on a test set or cross-validation (the 
evaluation would be less accurate when the test set is smaller than the 1000 
used in our simulation). Thinking of the F-real column as results from a 
test set, we get the following. When there are only n = 100 observations 
available, a test set would suggest to predict mainly based on the last five 
explanatory variables using A ~ 0.3 and with risk ~ y-real = 0.926. Note, 
the h mass of the first 25 coefficients is only 0.01, while the li mass of the 
last five is 0.207. When there are n = 500 observations, a test set would sug- 
gest A ~ 0.05 with resulting risk about 0.81. Note, when n = 500, the h mass 
of the first 25 coefficients is 2.277, while that of the last five is only 0.27. 
Indeed, with only 100 observations, the attempt to reveal the 25 "best" ex- 
planatory variables is too ambitious and the procedure gives up on it and 
settles for the inferior group of five. When the sample size is increased to 
500, there is a shift toward the first 25 variables. 

Comparing the simulated results with M = 1000 to those with M = 5000, 
we see that by screening in advance many superfiuous explanatory variables, 
reducing from m = 5005 to m = 1005, we hardly improve. In the case m = 
5005 the best value is attained when A = 0.09 and equals V-real = 0.862; 
in the case m = 1005 the best value is attained when A = 0.05 and equals 
y-real = 0.810. The improvement is by 0.052. One could argue that this 
improvement might be significant when compared to the risk magnitudes, 
0.810 and 0.862. As remarked in the Introduction, when the risk is small (or 
approaches 0), a more delicate analysis of rates of convergence, rather than 
only persistence, is desired. 

Note, however, that the slight advantage demonstrated when screening 
out successfully 4000 superfiuous explanatory variables (in our simulation 
changing m from 5005 to 1005) seems to occur in the "twilight zone," that 
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is, the zone where the constraint is not severe enough to produce estimators 
with generahzation power (or that are self-consistent). Compare in Table 1 
for the optimal constraint A = 0.05, F-real = 0.81, while F-training = 0.538. 
Such a "twilight zone" could be very abrupt in very high dimensions. Moving 
further from that zone will introduce singularity and the selected predictors 
will be totally unreliable. 

Acknowledgment. I am grateful to Anirban DasGupta for comments 
that led to a better presentation. 
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